Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/cg/sd: fix dbus connection leak (alternative) #2937

Merged
merged 3 commits into from
May 7, 2021

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented May 5, 2021

TL;DR: fixes a regression (open fd leak) caused by PR #2923. This is an alternative to #2936.

When running libcontainer/integration/TestSystemdFreeze 2000 times,
I got "too many open files" error after a while:

[vagrant@localhost integration]$ sudo -E PATH=$PATH ./integration.test -test.v -test.run Freeze -test.count 2000
<...>
=== RUN   TestFreeze
--- PASS: TestFreeze (0.28s)
=== RUN   TestSystemdFreeze
--- PASS: TestSystemdFreeze (0.42s)
=== RUN   TestFreeze
    utils_test.go:86: exec_test.go:536: unexpected error: container_linux.go:380: starting container process caused: process_linux.go:338: starting init process command caused: fork/exec /proc/self/exe: too many open files
        
--- FAIL: TestFreeze (0.16s)
=== RUN   TestSystemdFreeze
    utils_test.go:86: exec_test.go:536: unexpected error: container_linux.go:380: starting container process caused: process_linux.go:338: starting init process command caused: fork/exec /proc/self/exe: too many open files

<...>

--- FAIL: TestFreeze (0.00s)
=== RUN   TestSystemdFreeze
    utils_test.go:86: exec_test.go:536: unexpected error: container_linux.go:380: starting container process caused: process_linux.go:338: starting init process command caused: open /dev/null: too many open files
        
--- FAIL: TestSystemdFreeze (0.16s)
=== RUN   TestFreeze
    utils_test.go:86: exec_test.go:512: unexpected error: untar error "fork/exec /bin/sh: too many open files": ""
        
--- FAIL: TestFreeze (0.00s)
=== RUN   TestSystemdFreeze
    utils_test.go:86: exec_test.go:536: unexpected error: container_linux.go:364: creating new parent process caused: container_linux.go:496: including execfifo in cmd.Exec setup caused: open /tmp/libcontainer548959440/test/exec.fifo: too many open files
       
 --- FAIL: TestSystemdFreeze (0.17s)
=== RUN   TestFreeze
    utils_test.go:86: exec_test.go:512: unexpected error: untar error "open /dev/null: too many open files": ""
        
<...>

Indeed, systemd cgroup manager never closes its dbus connection.

This was not a problem before PR #2923 because we only had a single connection
for the whole runtime. Now it is per manager instance, so we leak a
connection (apparently two sockets) per cgroup manager instance.

The fix is to go back to using a single global dbus connection.
UPDATE this is now done to minimize the diff size.

This also makes it impossible to use both rootful and rootless dbus connections
at the same time. From what I understand, no runtime ever wants or needs to
do that, so it's not an issue in practice. An assertion is added to make sure this
never happens.

A test case is added in a separate commit. Before the fix, it always fails
like this:

        exec_test.go:2030: extra fd 8 -> socket:[659703]
        exec_test.go:2030: extra fd 11 -> socket:[658715]
        exec_test.go:2033: found 2 extra fds after container.Run
    --- FAIL: TestFdLeaksSystemd (0.20s)

@kolyshkin
Copy link
Contributor Author

Rebased (minor conflicts in libct/int), addressed a review nit.

@kolyshkin
Copy link
Contributor Author

CI failure is a known flake (#2907) which is appearing a lot lately.

@kolyshkin kolyshkin force-pushed the dbus-cm-leak-alt branch 2 times, most recently from d0c41d4 to e18bc8b Compare May 6, 2021 19:18
@kolyshkin
Copy link
Contributor Author

I have rewritten this PR to be less intrusive (don't change any of the callers). PTAL @cyphar @AkihiroSuda @mrunalp

Commit 47ef9a1 forgot to wrap GetManagerProperty("ControlGroup")
into retryOnDisconnect. Since there's one other user of
GetManagerProperty, add getManagerProperty wrapper and use it.

Fixes: 47ef9a1

Signed-off-by: Kir Kolyshkin <[email protected]>
Using per cgroup manager dbus connection instances means
that every cgroup manager instance gets a new connection,
and those connections are never closed, ultimately resulting
in file descriptors limit being hit.

Revert back to using a single global dbus connection for everything,
without changing the callers.

NOTE that it is assumed a runtime can't use both root and rootless
dbus at the same time. If this happens, we panic.

Signed-off-by: Kir Kolyshkin <[email protected]>
Add a test to check that container.Run do not leak file descriptors.

Before the previous commit, it fails like this:

    exec_test.go:2030: extra fd 8 -> socket:[659703]
    exec_test.go:2030: extra fd 11 -> socket:[658715]
    exec_test.go:2033: found 2 extra fds after container.Run

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor Author

Rebased to include #2941 to fix CI failures.

@AkihiroSuda
Copy link
Member

@kolyshkin Do you want #2936 or this one?

@kolyshkin
Copy link
Contributor Author

Do you want #2936 or this one?

Good question. @cyphar thinks #2936 is better (but he had not seen the last iteration of this one). @mrunalp thinks this one is better. I slightly favor this one -- it is less elegant than #2936 but practically makes more sense (why do we need more than 1 connection anyway?).

@AkihiroSuda
Copy link
Member

Merging. We can revisit #2936 later if it turned out to be better.

@AkihiroSuda AkihiroSuda merged commit f577072 into opencontainers:master May 7, 2021
@hqhq hqhq mentioned this pull request May 8, 2021
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants